[manpage_begin XMLPARSE n 0.9]
[moddesc   {XML}]
[titledesc {Parser for XML files in Fortran}]
[description]
The XML parser provided by this module has been written entirely in
Fortran, making it possible to read and write XML files without the need
to use mixed-language programming techniques.
[para]
It should be noted that the implementation has a number of limitations
(cf. the section Limitations). The module has the following features:

[list_begin bullet]
[bullet]
Reading an XML-file (within certain limitations) in a stream-oriented
manner.

[bullet]
Writing an XML-file in a stream-oriented manner.

[bullet]
Creating a reading routine that will fill a data structure. The data
structure is described via an XML file and all necessary code to read
files that conform to that structure is generated.

[list_end]

[para]
The module has been implemented in standard Fortran 90. It is the
intention to make it compilable by the F compiler as well, so that
it can be used in conjunction to a wide set of Fortran compilers.
[para]
(It should even be possible to convert the parsing routines to an
equivalent library in FORTRAN 77, though with the availability of
several free Fortran 95 compilers, there seems little need for that.)

[section "PROCEDURES"]
The module defines the following public routines and functions:
[list_begin definitions]

[call [cmd "subroutine xml_open("] [arg info], [arg filename], [arg mustread] )]
Open an XML-file and fill the structure [emph info], so that it can be
used to refer to the opened file.
[nl]
To check if all is well, (errors could be: the file can not be opened
for some reason), the function xml_error() is available.
[nl]
Arguments:
[list_begin definitions]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]
[arg filename] - CHARACTER(LEN=*) name of the file to be opened
[nl]
[arg mustread] - LOGICAL whether to read the file or to write to it
[list_end]
[nl]

[call [cmd "subroutine xml_close("] [arg info] )]
Close an opened XML-file. If the file was not opened, this routine has
no effect.
[nl]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]

[call [cmd "subroutine xml_options("] [arg info], ... )]
Set one or more options. These are all defined as optional arguments, so
that the [emph name=value] convention can be used to select an option
and to set its value. The first argument is fixed:
[nl]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]
All other arguments are optional and include:
[nl]
[list_begin definitions]
[arg ignore_whitespace] - LOGICAL compress the array of strings (remove
empty lines and remove leading blanks) for easier processing
[nl]
[arg no_data_truncation] - LOGICAL if data truncation occurs (too many
lines of data or too many attributes, so that they can not all be stored
in the arrays), this can be marked as an error or not. If the option is
set to [emph true], it is considered an error.
[nl]
[arg report_lun] - INTEGER LU-number of a file to which messages can be
logged (use XML_STDOUT for output to screen)
[nl]
[arg report_errors] - LOGICAL write error messages to the report
[nl]
[arg report_details] - LOGICAL write detailed messages to the report,
useful for debugging
[list_end]
[nl]
Note that these options are off by default. They should be set
after the file has been opened. The reporting options can be set before
an XML file has been opened, they hold globally (that is, they are in
effect for all reading and writing, independent of the files).
[nl]

[call [cmd "subroutine xml_get("] [arg info], [arg tag], [arg endtag], [arg attribs], \
    [arg no_attribs], [arg data], [arg no_data] )]
Read the current tag in the file up to the next one or the end-of-file.
Store the attributes in the given array and do the same for the
character data that may be present after the tag.
[list_begin definitions]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]
[arg tag] - CHARACTER(LEN=*) string that will hold the tag's name
[nl]
[arg endtag] - LOGICAL indicates whether the current tag has ended or
not
[nl]
[arg attribs] - CHARACTER(LEN=*), DIMENSION(:,:) array of strings that
will hold the attributes given to the tag
[nl]
[arg no_attribs] - INTEGER number of attributes that were found
[nl]
[arg data] - CHARACTER(LEN=*), DIMENSION(:) array of strings that
will hold the character data (one element per line)
[nl]
[arg no_data] - INTEGER number of lines of character data
[list_end]
Note:
[nl]
If an error occurs or end-of-file is found, then use the functions
[emph xml_ok()] and [emph xml_error()] to find out the conditions.
[nl]

[call [cmd "subroutine xml_put("] [arg info], [arg tag], [arg attribs], \
    [arg no_attribs], [arg data], [arg no_data], [arg type] )]
Write the information for the current tag to the file. This subroutine
is the inverse, so to speak, of the subroutine [emph xml_get] that
parses the XML input.
[nl]
For a description of the arguments, other than [emph type]: see above.
[nl]
[arg type] - CHARACTER(LEN=*) string having one the following values:
[list_begin bullet]
[bullet]
'open' - Write an opening tag with attributes and data (if there
are any). Useful for creating a hierarchy of tags.
[bullet]
'close' - Write a closing tag
[bullet]
'elem' - Write the element data
[list_end]
[nl]

[call [cmd "logical function xml_ok("] [arg info] )]
Returns whether the parser is still okay (no read errors or
end-of-file).
[nl]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]

[call [cmd "logical function xml_error("] [arg info] )]
Returns whether the parser has encountered some error (see also the
options).
[nl]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]

[call [cmd "logical function xml_data_trunc("] [arg info] )]
Returns whether the parser has had to truncate the data or the
attributes.
[nl]
[arg info] - TYPE(XML_PARSE) structure used to identify the file
[nl]

[call [cmd "integer function xml_find_attrib("] [arg attribs], \
   [arg no_attribs], [arg name], [arg value] )]
Convenience function that searches the list of attributes and returns
the index of the sought attribute in the array or -1 if not present.
In that case the argument [emph value] is not set, so that you can use
this to supply a default.
[list_begin definitions]
[arg attribs] - CHARACTER(LEN=*), DIMENSION(:,:) array of strings that
hold the attributes
[nl]
[arg no_attribs] - INTEGER number of attributes that was found
[nl]
[arg name] - CHARACTER(LEN=*) name of the attribute to be found
[nl]
[arg value] - CHARACTER(LEN=*) actual or default value of the attribute
upon return
[list_end]

[call [cmd "subroutine read_xml_file_xxx("] [arg filename], \
   [arg lurep], [arg error] )]
Subroutine generated via the method described below to read an XML
file of a particular structure.
[list_begin definitions]
[arg filename] - CHARACTER(LEN=*) name of the XML file to read
[nl]
[arg lurep] - INTEGER LU-number to use for reporting errors (use 0 to
write to the screen; optional)
[nl]
[arg error] - LOGICAL variable that indicates if an error occurred
while reading (optional).
[list_end]

[call [cmd "subroutine xml_process("] [arg filename], \
   [arg attribs], [arg data], \
   [arg startfunc], [arg datafunc], [arg endfunc], \
   [arg lurep], [arg error] )]
Subroutine that reads the XML file and calls three user-defined
subroutines to take care of the actual processing. This is a
routine that implements the so-called SAX approach.
[list_begin definitions]
[arg filename] - CHARACTER(LEN=*) name of the XML file to read
[nl]
[arg attribs] - CHARACTER(LEN=*), DIMENSION(:,:) work array to store the
attributes
[nl]
[arg data] - CHARACTER(LEN=*), DIMENSION(:) work array to store the
character data associated with a tag
[nl]

[arg startfunc] - Subroutine that is called to handle the [emph start]
of a tag:
[example {
    subroutine startfunc( tag, attribs, error )
       character(len=*)                 :: tag
       character(len=*), dimension(:,:) :: attribs
       logical                          :: error
}]
[nl]
If the argument error is set to true (because the tag was unexpected or
something similar), the reading is interrupted and the routine returns.
Only the fact that something was wrong is recorded. You need to use
other means to convey more information if that is needed.

[nl]
[arg datafunc] - Subroutine that is called to handle the [emph "character data"]
associated with a tag:
[example {
    subroutine datafunc( tag, attribs, error )
       character(len=*)               :: tag
       character(len=*), dimension(:) :: data
       logical                        :: error
}]
[nl]

[arg endfunc] - Subroutine that is called to handle the [emph end]
of a tag:
[example {
    subroutine endfunc( tag, error )
       character(len=*)               :: tag
       logical                        :: error
}]
[nl]
[arg lurep] - INTEGER LU-number to use for reporting errors (use 0 to
write to the screen; optional)
[nl]
[arg error] - LOGICAL variable that indicates if an error occurred
while reading (optional).
[list_end]

[list_end]

[section MOTIVATION]
The use of XML-files as a means to store data and more importantly to
transfer data between very disparate applications and organisations has
been growing these last few years. Standard implementations of libraries
that deal with all features of XML or a significant part of them are
available in many languages, but as far as we know there was no
implementation in Fortran.
[para]
One could of course use, say, the well-known Expat library by ... and
provide a Fortran interface, but this is slightly awkward as it forces
one to have a compatible C compiler. More importantly, this introduces
platform-dependencies because the interfacing between Fortran and C
depends strongly on the used compilers and this introduces a way of
working that is alien to Fortran programmers: Expat requires the
programmer to register a callback function, to be called when some
"event" occurs while reading the file (a begin tag is found, character
data are found and so on).
[para]
The alternative is even more awkward: build a tree of tags and
associated data and ask for these data. To a Fortran programmer, one of
the first things they will want to do with an XML-file is to get all the
information out - so a stream-oriented parsing method is more
appropriate.
[para]
Among the two predominant types of XML-parsing, SAX or stream-oriented
parsing and DOM or object-oriented parsing, the stream-oriented approach
is more suitable to the frame of mind of the average Fortran programmer.
But instead of registering callbacks, this module uses the method known
from, for instance, GNU's getopt() function: parse the data and return
to the caller to have it process the information. The caller calls the
function again and again, letting getopt() take care of the details.
[para]
This is exactly the approach taken by the [emph xmlparse] module:
[example {
    call xml_open(info, ... )

    do while ( xml_ok(info) )
       call xml_get(info, ... ) ! Get the first/next tag
       ... identify the tag (via xml_check_tag for instance)
       ... process the information
    enddo

    call xml_close(info)

    ... proceed with the rest of the program
}]

[para]
For convenience, the module does supply the routine [emph xml_process]
that takes three user-defined subroutines to perform the actual
processing. The file will be processed in its entirety.

[section "PARAMETERS AND DERIVED TYPES"]
The module defines several parameters and derived types for use by the
programmer:
[list_begin definitions]
[lst_item XML_BUFFER_LENGTH]
the length of the internal buffer, representing
the maximum length of any individual line in an XML file and the maximum
length for a tag including all its attributes.

[lst_item XML_STDOUT]
a parameter to indicate the standard output (or *) as the file to
write messages to.

[lst_item type(XML_PARSE)]
the data structure that holds information about
the XML file to be read or written. Its contents are partially
accessible via functions such as XML_OK() and XML_ERROR().
[emph Note:] do not use its contents directly, as these may change in
future.

[list_end]

[section "GENERATING A READING ROUTINE"]
Reading an XML file and making sure the data are structured the way
they are supposed to, generally requires a lot of code. This can not be
avoided: you will want to make sure everything you need is there and
anything else is dealt with appropriately.
[para]
There is a way out: by automatically generating the reading routine
you can reduce the amount of manual coding to a minimum. This has two
advantages:
[list_begin bullet]
[bullet]
It is much less work to define the data and their place in an XML file
than it is to encode the reading routine.
[bullet]
It is much less error-prone, if the logic is generated for you and
therefore you need much less testing.
[list_end]
The idea is simple:
[para]
In an XML-file you define the data structure and the way this data
structure should appear in an input XML file for your program.
The process is probably best explained via an example.
[para]
Say, you want to read addresses (a classical example). Each address
consists of the name of the person, street name and the number of
the house, city (let us keep it simple). Of course we have multiple
addresses, so they are stored in an array. Then via the
[emph xmlreader] program you can generate a reading routine that
deals with this type of information.
[para]
The program takes an XML file as input and produces a Fortran 90 module
that reads input files and stores the data in the designated variables.
It also creates a writing routine to write the data to an XML file.
[para]
In our case, we want a derived type to hold the various pieces
that form a complete address and we want an array of that type:
[example {
<typedef name="address_type">
   <component name="person" type="character" length="40">
   <component name="street" type="character" length="40">
   <component name="number" type="integer">
   <component name="city"   type="character" length="40">
</typedef>
<variable name="adress" dimension="1">
}]

This will produce the following derived type:
[example {
type address_type
   character(len=40) :: person
   character(len=40) :: street
   integer           :: number
   character(len=40) :: city
end type address_type
}]
and a variable "address":
[example {
type(address_type), dimension(:), pointer :: address
}]

The reading routine will be able to read such XML files as the
following:
[example {
<address>
   <person>John Doe</person>
   <street>Wherever street</street>
   <number>30</number>
   <city>Erewhon</city>
</address>
<address>
   ...
</address>
...
}]
If in some address the number was forgotten, the reading routine will
report this, as by default all variables and components in a derived
type must be present.

[para]
Here is a more detailed description of the XML files accepted by the
[emph xmlreader] program:
[list_begin bullet]
[bullet]
Use the [emph comment] tag to insert comments in the input file to
[emph reader] (or the input to the resulting reading routines)

[bullet]
The [emph options] tag can be used to influence the generated code:
[list_begin bullet]
[bullet]
The attribute "strict" determines whether unknown tags are
regarded as an error ([emph strict="yes"]) or not ([emph strict="no"],
the default).
[bullet]
The attribute "globaltype" is used to indicate that all variables should
belong to a single derived type, whose name defaults to the name of the
file. Use the "typename" attribute to set the name to a different value.
[list_end]

[bullet]
If you want to group tags for several variables, but you do not
want to introduce a special derived type, you can do so with the
[emph placeholder] tag. Its effect is to require an additional
tag - end tag surrounding the data. Any tags defined within the
placeholder - end placeholder tags will have to be put in the
corresponding tags in the input file for the resulting program.
[example {
<placeholder tag="grid">
    <variable x ...>
    <variable y ...>
</placeholder>
}]

[bullet]
[emph variable] tags correspond directly to module variables.
They are used to declare these variables and to generate the code that will
read them.
[nl]
Variable tags can appear anywhere except within a type definition.
Variables can be of a previously defined derived type or of a
primitive type.
[example {
<variable name="x" type="integer" default="1" />
}]
Variables can have a number of attributes:
[list_begin bullet]
[bullet]
Required attributes:
[list_begin definitions]
[arg name] - the name of the variable in the actual program
[nl]
[arg type] - the type of the variable
[nl]
[arg length] - for character types only, the length of the string
[list_end]
[nl]

[bullet]
Optional attributes:
[list_begin definitions]
[arg default] - the default value to be used if information is missing
[nl]
[arg dimension] - the number of dimensions (up to 3), gives rise to a
pointer component
[nl]
[arg shape] - the fixed size of an array, if this is present, the
number of dimensions is taken from this attribute.
[nl]
[arg tag] - the name of the tag that holds the data (default to
the name of the variable)
[list_end]
[nl]

[bullet]
Basic types for the variables include:
[list_begin definitions]
[arg integer] - a single integer value
[nl]
[arg integer-array] - a one-dimensional array of integer values (the
values must appear between an opening and ending tag)
[arg real] - a single-precision real value
[nl]
[arg real-array] - a one-dimensional array of real values (the
values must appear between an opening and ending tag)
[nl]
[arg double] - a double-precision real value
[nl]
[arg double-array] - a one-dimensional array of double-precision values
(the values must appear between an opening and ending tag)
[nl]
[arg logical] - a single logical value (represented as "T" or "F")
[nl]
[arg logical-array] - a one-dimensional array of logical values
(the values must appear between an opening and ending tag)
[nl]
[arg word] - a character string as can be read via list-directed input
(if it should contain spaces, surround it with single or double quotes)
[nl]
[arg word-array] - a one-dimensional array of strings
(the values must appear between an opening and ending tag)
[nl]
[arg line] - a character string as can be read from a single line
of text (via the '(A)' format)
[nl]
[arg line-array] - a one-dimensional array of strings, read as
individual lines between the opening and closing tag
[nl]
[arg character] - a character string (synonym for "line")
[nl]
[arg character-array] - a one-dimensional array of character strings,
synonym for line-array
[list_end]
[list_end]
[nl]

[bullet]
Type definitions ([emph typedef])allow the [emph xmlreader] program to
define the derived types that you want to use in your reader.
[nl]
The [emph "typedef"] tag may only contain [emph "component"] tags. They
are synonym to [emph "variable"] tags with the same restrictions.

[list_end]

[para]
Future versions may also include options for:
[list_begin bullet]
[bullet]
Adding code to handle certain data in a particular way
[bullet]
Version checking (so that an input file is explicitly identified
as being of a particular version of the software)
[list_end]

[section EXAMPLES]
The directory "examples" contains some example programs.
[list_begin bullet]
[bullet]
The [emph tst_grid] program demonstrates how to create a reader
for an array of "grids", each consisting of two integers.
[bullet]
The [emph tst_menu] program uses a more elaborate structure,
a menubar with menus and each menu having an array of items.
Items in a menu can have a submenu. This leads to an XML file with
multiple hierarchical layers.
[bullet]
The [emph tst_process] program uses the [emph xml_process] routine to
read in an XML file (a "docbook" file) and turn it into an HTML file for
viewing.
[list_end]


[section LIMITATIONS]
Basic limitations:
[list_begin bullet]
[bullet]
The lines in the XML-file should not exceed 1000 characters. For tags
that span more than one line, the limit holds for all the lines together
(without leading or trailing blanks).

[bullet]
There is no support for DTDs or namespaces, XSLT, XPath and
other more advanced features around XML.

[bullet]
There is currently no support for the object-oriented approach. It is up
to the application to store the information that is needed, while the
parsing is going on.

[bullet]
No support (yet) for a single quote as delimiter

[bullet]
No support (yet) for conversion of escape sequences (&gt. for instance)

[bullet]
The parser may not handle malformed XML-files properly

[bullet]
The parser does not (yet) handle different line-endings properly (that
is: reading XML-files that were written under MS Windows in a UNIX or
Linux environment)

[list_end]

[section "RELEASE NOTES"]
This document belongs to [emph "version 1.00"] of the module.
[para]
History:
[para]
[emph "version 0.1:"] Proof of concept, august 2003
[para]
A very preliminary version meant to show that it is indeed possible to
read and write XML files using Fortran only. It was published on the
comp.lang.fortran newsgroup and generated enough interest to encourage
further development.
[para]
[emph "version 0.2:"] First public release, august 2003
[para]
After some additional testing with practical XML-files, a number of bugs
were found and solved, several enhancements were made:
[list_begin bullet]
[bullet]
Handling attributes (especially when tags span more than one line and
correctly handling the case that too many attributes are present).
[bullet]
Options for parsing and error handling added, as well as functions to
check the status.
[bullet]
Revision of the API, for more uniform names (prefix: xml_)
[bullet]
Setting up the documentation (this document in particular)
[list_end]
[para]
[emph "version 0.3:"] Improvements, september 2003
[list_begin bullet]
[bullet]
Added the function xml_error()
[bullet]
Implemented the report options
[bullet]
Corrected a bug in xml_close (causing an infinite loop in the
test program).
[bullet]
Revised the test program to run through a number of test
files.
[list_end]
[para]
[emph "version 0.4:"] Corrected xml_put(), october 2003
[list_begin bullet]
[bullet]
Adjusted the interface and implementation of the subroutine xml_put()
It will now produce correct and reasonably looking XML files.
[bullet]
Added a test program, tstwrite.f90, for this.
[list_end]

[para]
[emph "version 0.9:"] Added new approach, october 2005
[list_begin bullet]
[bullet]
Changes to the interface and implementation of the subroutine xml_put(),
from a patch by [emph cinonet].
[bullet]
Added a program, xmlreader, to generate complete reading routines for
particular XML files ([emph cf.] [sectref "GENERATING A READING ROUTINE"]
[list_end]

[para]
[emph "version 0.94:"] Gradually expanding the capabilities, june 2006
[list_begin bullet]
[bullet]
Added a routine [emph xml_process] that enables you to use an
event-based approach like in the famous Expat library.
[bullet]
Added the option [emph strict] and the tag [emph placeholder].
[bullet]
Corrected a number of bugs associated with the xmlreader program
[list_end]

[para]
[emph "version 0.97:"] Added the following capabilities to the
xmlreader program since 0.94, june 2007
[list_begin bullet]
[bullet]
Support for the [emph shape] option
[bullet]
Defaults for both components of a derived type and for
independent variables.
[bullet]
The generated reading routine takes care of elements that have
attributes and character data now. The character data is treated as if
it were an attribute with the name "value"
[bullet]
Several bugs corrected in the xmlreader program
[list_end]

[para]
[emph "version 1.00:"] Added the following capabilities to the
xmlreader program since 0.97, april 2008
[list_begin bullet]
[bullet]
Write a writing routine to write the data to a XML file
[list_end]
The project now also contains a first version of a program to convert an
XSD file to a file accepted by the xmlreader program. This is called
"xsdconvert".

[section "TO DO"]
The following items remain on the "to do" list:
[list_begin bullet]
[bullet]
Adding checks for truncation of strings (attribute names/values too
long, data lines too long; now only the number is checked).
[bullet]
Documenting details about structures and parameters that may be of
interest.
[list_end]

[keywords Fortran XML parsing]

[manpage_end]
